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Abstract 

We present an overview of current research on artificial neural networks, emphasizing a statistical per¬ 
spective. We view neural networks as parameterized graphs that make probabilistic assumptions about 
data, and view learning algorithms as methods for finding parameter values that look probable in the light 
of the data. We discuss basic issues in representation and learning, and treat some of the practical issues 
that arise in fitting networks to data. We also discuss links between neural networks and the general 
formalism of graphical models. 


Copyright © Massachusetts Institute of Technology, 1996 


In press: A. Tucker, (Ed.), CRC Handbook of Computer Science, CRC Press, Boca Raton, FL. This report describes research 
done at the Dept, of Brain and Cognitive Sciences, the Center for Biological and Computational Learning, and the Artificial 
Intelligence Laboratory of the Massachusetts Institute of Technology. Support for CBCL is provided in part by a grant from the 
NSF (ASC-9217041). Support for the laboratory’s artificial intelligence research is provided in part by the Advanced Research 
Projects Agency of the Dept, of Defense. The authors were supported by a grant from the McDonnell-Pew Foundation, by a 
grant from Siemens Corporation, by a grant from Daimler-Benz Systems Technology Research, and by a grant from the Office 
of Naval Research. Michael I. Jordan is a NSF Presidential Young Investigator. 



1 Introduction 

Within the broad scope of the study of artificial intelli¬ 
gence, research in neural networks is characterized by a 
particular focus on pattern recognition and pattern gen¬ 
eration. Many neural network methods can be viewed as 
generalizations of classical pattern-oriented techniques in 
statistics and the engineering areas of signal processing, 
system identification and control theory. As in these 
parent disciplines, the notion of “pattern” in neural net¬ 
work research is essentially probabilistic and numerical. 
Neural network methods have had their greatest impact 
in problems where statistical issues dominate and where 
data are easily obtained. 

A neural network is first and foremost a graph, with 
patterns represented in terms of numerical values at¬ 
tached to the nodes of the graph, and transformations 
between patterns achieved via simple message-passing 
algorithms. Many neural network architectures, how¬ 
ever, are also statistical processors, characterized by 
making particular probabilistic assumptions about data. 
As we will see, this conjunction of graphical algorithms 
and probability theory is not unique to neural networks, 
but characterizes a wider family of probabilistic systems 
in the form of chains, trees, and networks that are cur¬ 
rently being studied throughout AI [Spiegelhalter, et ah, 
1993], 

Neural networks have found a wide range of applica¬ 
tions, the majority of which are associated with problems 
in pattern recognition and control theory. In this con¬ 
text, neural networks can best be viewed as a class of al¬ 
gorithms for statistical modeling and prediction. Based 
on a source of training data, the aim is to produce a 
statistical model of the process from which the data are 
generated, so as to allow the best predictions to be made 
for new data. We shall find it convenient to distinguish 
three broad types of statistical modeling problem, which 
we shall call density estimation, classification and regres¬ 
sion. 

For density estimation problems (also referred to as 
unsupervised learning problems), the goal is to model 
the unconditional distribution of data described by some 
vector x. A practical example of the application of den¬ 
sity estimation involves the interpretation of X-ray im¬ 
ages (mammograms) used for breast cancer screening 
[Tarassenko, 1995]. In this case the training vectors x 
form a sample taken from normal (non-cancerous) im¬ 
ages, and a network model is used to build a representa¬ 
tion of the density p(x). When a new input vector x' is 
presented to the system, a high value for p(x') indicates 
a normal image while a low value indicates a novel input 
which might be characteristic of an abnormality. This 
is used to label regions of images which are unusual, for 
further examination by an experienced clinician. 

For classification and regression problems (often re¬ 
ferred to as supervised learning problems), we need to 
distinguish between input variables, which we again de¬ 
note by x, and target variables which we denote by the 
vector t. Classification problems require that each input 
vector x be assigned to one of C classes C\, . . . ,Cc, in 
which case the target variables represent class labels. As 
an example, consider the problem of recognizing hand¬ 


written digits [LeCun, et ah, 1989]. In this case the input 
vector would be some (pre-processed) image of the digit, 
and the network would have ten outputs, one for each 
digit, which can be used to assign input vectors to the 
appropriate class (as discussed in Section 2). 

Regression problems involve estimating the values of 
continuous variables. For example, neural networks have 
been used as part of the control system for adaptive op¬ 
tics telescopes [Sandler, et ah, 1991]. The network input 
x consists of one in-focus and one de-focused image of 
a star and the output t consists of a set of coefficients 
that describe the phase distortion due to atmospheric 
turbulence. These output values are then used to make 
real-time adjustments of the multiple mirror segments to 
cancel the atmospheric distortion. 

Classification and regression problems can also be 
viewed as special cases of density estimation. The most 
general and complete description of the data is given by 
the probability distribution function p(x, t) in the joint 
input-target space. However, the usual goal is to be able 
to make good predictions for the target variables when 
presented with new values of the inputs. In this case it 
is convenient to decompose the joint distribution in the 
form: 

p(x,t) = p(t|x)p(x) (1) 

and to consider only the conditional distribution p(t|x), 
in other words the distribution of t given the value of x. 
Thus classification and regression involve the estimation 
of conditional densities, a problem which has its own 
idiosyncracies. 

The organization of the chapter is as follows. In Sec¬ 
tion 2 we present examples of network representations 
of unconditional and conditional densities. In Section 3 
we discuss the problem of adjusting the parameters of 
these networks to fit them to data. This problem has 
a number of practical aspects, including the choice of 
optimization procedure and the method used to control 
network complexity. We then discuss a broader perspec¬ 
tive on probabilistic network models in Section 4. The 
final section presents further information and pointers to 
the literature. 

2 Representation 

In this section we describe a selection of neural network 
architectures that have been proposed as representations 
for unconditional and conditional densities. After a brief 
discussion of density estimation, we discuss classifica¬ 
tion and regression, beginning with simple models that 
illustrate the fundamental ideas and then progressing to 
more complex architectures. We focus here on represen¬ 
tational issues, postponing the problem of learning from 
data until the following section. 

2.1 Density estimation 

We begin with a brief discussion of density estimation, 
utilizing the Gaussian mixture model as an illustrative 
model. We return to more complex density estimation 
techniques later in the chapter. 

Although density estimation can be the main goal of a 
learning system, as in the diagnosis example mentioned 



in the introduction, density estimation models arise more 
often as components of the solution to a more general 
classification or regression problem. To return to Eq. 1, 
note that the joint density is composed of p(t|x), to be 
handled by classification or regression models, and p(x), 
the (unconditional) input density. There are several rea¬ 
sons for wanting to form an explicit model of the input 
density. First, real-life data sets often have missing com¬ 
ponents in the input vector. Having a model of the den¬ 
sity allows the missing components to be “filled in” in an 
intelligent way. This can be useful both for training and 
for prediction [cf. Bishop, 1995]. Second, as we see in 
Eq. 1, a model of p(x) makes possible an estimate of the 
joint probability p(x,t). Thus in turn provides us with 
the necessary information to estimate t.life: “inverse” con¬ 
ditional density p(x|t). The calculation of such inverses 
is important for applications in control and optimization. 

A general and flexible approach to density estimation 
is to treat the density as being composed of a set of M 
simpler densities. This approach involves modeling the 
observed data as a sample from a mixture density: 

M 

p(x|w) = ^7r !: p(x|*,w !: ), (2) 

8 = 1 

win iv i In 7T,; are constants known as mixing proportions , 
and the p(x|*,w,;) are the component densities , gener¬ 
ally taken to be from a simple parametric family. A 
common choice of component density is the multivari¬ 
ate Gaussian, in which case the parameters w,; are the 
means and covariance matrices of each of the compo¬ 
nents. By varying the means and covariances to place 
and orient the Gaussians appropriately, a widfe variety 
of high-dimensional, multi-modal data can be modeled. 
This approach to density estimation is essentially a prob¬ 
abilistic form of clustering. 

Gaussian mixtures have a representation as a network 
diagram as shown in Figure 1. The utility of such net¬ 
work representations will become clearer as we proceed; 
for now, it suffices to note that not only mixture models, 
but also a wide variety of other classical statistical mod¬ 
els for density estimation are representable as simple net¬ 
works with one or more layers of adaptive weights. These 
methods include principal component analysis , canonical 
correlation analysis , kernel density estimation and factor 
analysis [Anderson, 1984]. 

2.2 Linear regression and linear discriminants 

Regression models and classification models both focus 
on the conditional density p(t|x). They differ in that 
in regression the target vector t is a real-valued vector, 
whereas in classification t takes its values from a discrete 
set representing the class labels. 

The simplest probabilistic model for regression is one 
in which t is viewed as the sum of an underlying deter¬ 
ministic function /(x) and a Gaussian random variable 
e: 

t = /(x) + e. (3) 

If e has zero mean, as is commonly assumed, /(x) then 
becomes the conditional mean ff(t|x). It is this function 
that is the focus of most regression modeling. Of course, 
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Figure 1: A network representation of a Gaussian mix¬ 
ture distribution. The input pattern x is represented 
by numerical values associated with the input nodes in 
the lower level. Each link has a weight fiij, which is the 
j th component of the mean vector for the I th Gaussian. 
The I th intermediate node contains the covarianfll ma¬ 
trix Si and calculates the Gaussian conditional proba¬ 
bility p(x|*, Hi, Si). These probabilities are weighted by 
the mixing proportions 7r,; and the output node calculates 
the weighted sum p(x) = JA 7T,;p(x|*, Hi, X*i)- 


the conditional mean describes only the first moment 
of the conditional distribution, and, as we discuss in a 
later section, a good regression model will also generally 
report information about, the second moment. 

In a linear regression model the conditional mean is a 
linear function of x: ff(t|x) = TTx, for a fixed matrix W. 
Linear regression has a straightforward representation as 
a network diagram in which the j th input unit represents 
the j th component of the input vector xj, each output 
unit i takes the weighted sum of the input values, and 
the weight W{j is placed on the link between the j th input 
unit and the I th output unit. 

The conditional mean is also an important function in 
classification problems, but most of the focus in classifi¬ 
cation is on a different function known as a discriminant 
function. To see how this function arises and to relate it 
to the conditional mean, we consider a simple two-class 
problem in which the target is a simple binary scalar 
that w# now denote by t. The conditional mean E(t. |x) 
is equal to the probability that t. equals one, and this 
latter probability can be expanded via Bayes rule: 

_ p(x\t = l)p(t, = 1) 

' } P(x) 

The density p(f|x) in this equation is referred to as the 
posterior probability of the class given the input, and the 
density p(x|f) is referred to as the the class-conditional 
density. Continuing the derivation, we expand the de¬ 
nominator and (with some foresight) introduce an expo- 








nential: 
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We see that the posterior probability can be written in 
the form of the logistic function: 


V = 


1 

1 + e - -' ’ 


( 6 ) 


where z is a function of the likelihood ratio p(x\t. = 
l)/p(x|f = 0), and the prior ratio p(t = 1 )/p(t = 0). 
This is a useful representation of the posterior probabil¬ 
ity if z turns out to be simple. 

It is easily verified that if the class conditional densi¬ 
ties are multivariate Gaussians with identical covariance 
matrices, then z is a linear function of x: z = w T x + wo- 
Moreover this representation is appropriate for any dis¬ 
tribution in a broad class of densities known as the expo¬ 
nential family (which includes the Gaussian, the Poisson, 
the gamma, the binomial, and many other densities). All 
of the densities in this family can be put in the following 
form: 


g(x; 9 , <j>) = exp{(0 T x - b{9))/a{<f>) + c(x, (7) 

where 8 is the location parameter , and <fi is the scale pa¬ 
rameter. Substituting this general form in Eq. 5, where 8 
is allowed to vary between the classes and <fi is assumed to 
be constant between classes, we see that z is in all cases 
a linear function. Thus the choice of a linear-logistic 
model is rather robust. 

The geometry of the two-class problem is shown in 
Figure 2, which shows Gaussian class-conditional densi¬ 
ties, and suggests the logistic form of the posterior prob¬ 
ability. 

The function z in our analysis is an example of a dis¬ 
criminant function. In general a discriminant function is 
any function that can be used to decide on class member¬ 
ship (Duda and Hart, 1972); our analysis has produced a 
particular form of discriminant function that is an inter¬ 
mediate step in the calculation of a posterior probability. 
Note that if we set z = 0, from the form of the logistic 
function we obtain a probability of 0.5, which shows that 
z = 0 is a decision boundary between the two classes. 

The discriminant function that we found for expo¬ 
nential family densities is linear under the given con¬ 
ditions on <fi. In more general situations, in which the 
class-conditional densities are more complex than a sin¬ 
gle exponential family density, the posterior probability 
will not be well characterized by the linear-logistic form. 
Nonetheless it is still useful to retain the logistic function 
and focus on nonlinear representations for the function 
r. This is the approach taken within the neural network 
field. 

To summarize, we have identified two functions that 
are important for regression and classification, respec¬ 
tively: the conditional mean and the discriminant func¬ 
tion. These are the two functions that are of concern for 
simple linear models and, as we now discuss, for more 
complex nonlinear models as well. 



Figure 2: This shows the Gaussian class-conditional den¬ 
sities p(,c|Ci) (dashed curves) for a two-class problem 
in one dimension, together with the corresponding pos¬ 
terior probability p(Ci|,c) (solid curve) which takes the 
form of a logistic sigmoid. The vertical line shows the 
decision boundary for y = 0.5 which coincides with the 
point at which the two density curves cross. 


2.3 Nonlinear regression and nonlinear 
classification 


The linear regression and linear discriminant functions 
introduced in the previous section have the merit of sim¬ 
plicity, but are severely restricted in their representa¬ 
tional capabilities. A convenient way to see this is to 
consider the geometrical interpretation of these models. 
When viewed in the d-dimensional x-spaee, the linear re¬ 
gression function w T x + up is constant on hyper-planes 
which are orthogonal to the vector w. For many practi¬ 
cal applications we need to consider much more general 
classes of function. We therefore seek representations for 
nonlinear mappings which can approximate any given 
mapping to arbitrary accuracy. One way to achieve this 
is to transform the original x using a sdi? of M nonlinear 
functions where j = 1, . . ., M, and then to form a 

linear combination of these functions, so that: 


X!"'A' vix! - ( 8 ) 

i 


3 


For a sufficiently large value of M, and for a suitable 
choice of the </>j(x), such a model has the desired ‘uni¬ 
versal approximation’ properties. A familiar example, 
for the case of 1-dimensional input spaces, is the simple 
polynomial, for which the are simply successive 

powers of x and the u >'s are the polynomial coefficients. 
Models of the form in Eq. 8 have the property that they 
can be expressed as network diagrams in which there is 
a single layer of adaptive weights. 

There are a variety of families of functions in one di¬ 
mension that can approximate any continuous function 
to arbitrary accuracy. There is, however, an important 
issue which must be addressed, called the curse of di¬ 
mensionality. If, for example, we consider an M th -order 
polynomial then the number of independent coefficients 
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Figure 3: An example of a feed-forward network having 
two layers of adaptive weights. The bias parameters in 
the first layer are shown as weights from an extra input 
having a fixed value of #0 = 1. Similarly, the bias param¬ 
eters in the second layer are shown as weights from an 
extra hidden unit, with activation again fixed at zq = 1. 


grows as d M [Bishop, 1995]. For a typical medium-scale 
application with, say, 30 inputs a fourth-order polyno¬ 
mial (which is still quite restricted in its representational 
capability) would have over 46,000 adjustable parame¬ 
ters. As we shall see in Section 3.3 in order to achieve 
good generalization it is important to have more data 
points than adaptive parameters in the model, and this 
is a serious problem for methods that have a power law 
or exponential growth in the number of parameters. 

A solution to the problem lies in the fact that, for 
most real-world data sets, there are strong (often non¬ 
linear) correlations between the input variables such that 
the data does not uniformly fill the input space but is 
effectively confined to a sub-space whose dimensionality 
is called the intrinsic dimensionality of the data. We 
can take advantage of this phenomenon by considering 
again a model of the form in Eq. 8 but in which the ba¬ 
sis functions are adaptive so that they themselves 

contain weight parameters whose values can be adjusted 
in the light of the observed data set. Different models 
result from different choices for the basis functions, and 
here we consider the two most common examples. The 
first of these is called the multilayer perception (MLP) 
and is obtained by choosing the basis functions to be 
given by linear-logistic functions (Eq. 6). This leads to 
a multivariate nonlinear function that can be expressed 
in the form: 

M / d \ 

Uk (x) = ^2 w kj9 21 w i iXi + u ’j° + Wk °- ( 9 ) 

j = 1 \ i = 1 / 

Here u>jo and u>ko are bias parameters, and the basis 
functions are called hidden units. The function g(-) is 
the logistic sigmoid function of Eq. 6. This can also be 
represented as a network diagram as in Figure 3. Such a 
model is able to take account of the intrinsic dimension¬ 
ality of the data because the first-layer weights u>ji can 


adapt, and hence orient the surfaces along which the basis 
function response is constant, ft. has been demonstrated 
that models of this form can approximate to arbitrary 
accuracy any continuous function, defined on a compact 
domain, provided the number M of hidden units is suffi¬ 
ciently large. The MLP model can be extended by con¬ 
sidering several successive layers of weights. Note that 
the use of nonlinear activation functions is crucial, since 
if g(-) in Eq. 9 were replaced by the identity, the network 
would reduces to several successive linear transformations 
which would itself be linear. 

The second common network model is obtained by 
choosing the basis functions <jq(x) in Eq. 8 to be func¬ 
tions of the radial variable x — fij where fij is the center 
of the jt.h basis function, which gives rise to the radial 
basis fun ction (RBF) network model. The most common 
example uses Gaussia.ns of the form: 

<M X ) = exp j-^(x - fijfZJ^x. - A*j) J • (10) 

Here both the mean vector fij and the cova.ria.nqft matrix 
2j are considered to be adaptive parameters. The curse 
of dimensionality is alleviated because the basis func¬ 
tions can be positioned and oriented in input space such 
as to overlay the regions of high data density and hence 
to capture the nonlinear correlations between input vari¬ 
ables. Indeed, a common approach to training an RBF 
network is to use a two-stage procedure [Bishop, 1995]. 
In the first, stage the basis function parameters are de¬ 
termined using the input, data, alone, which corresponds 
to a. density estimation problem using a. mixture model 
in which the; Component, densities are given by the basis 
functions <jq(x). In the second stage the basis function 
parameters are, frozen and the second-layer weights w^j 
are found by standard least-squares optimization proce¬ 
dures. 

2.4 Decision trees 

MLP and RBF networks are often contrasted in terms 
of the support, of the basis functions that, compose them. 
MLP networks are often referred to as “global,” given 
that, linear-logistic basis functions are bounded a.wa.y 
from zero over a. significant, fraction of the input, space. 
Accordingly, in an MLP, each input, vector generally 
gives rise to a. distributed pattern over the hidden units. 
RBF networks, on the other hand, are referred to as 
“local,” due to the fact that, their Gaussian basis func¬ 
tions typically have support, over a. local region of the 
input, space. It. is important, to note, however, that lo¬ 
cal support, does not. necessarily mean non-overlapping 
support.; indeed, there is nothing in the RBF model that, 
prefers basis functions that, have non-overlapping sup¬ 
port.. A third class of model that, does focus on basis 
functions with non-overlapping support, is the decision 
tree model [Breima.n, ft a.L, 1984]. A decision tree is a. 
regression or classification model that, can be viewed as 
asking a. sequence of questions about, the input, vector. 
Each question is implemented as a. linear discriminant., 
and a. sequence of questions can be viewed as a. recursive 
partitioning of the input, space. All inputs that, arrive 
at. a. particular leaf of the tree define a. polyhedral region 






in the input space. The collection of such regions can 
be viewed as a set of basis functions. Associated with 
each basis function is an output value which (ideally) is 
close to the average value of the conditional mean (for 
regression) or discriminant function (for classification; a 
majority vote is also used). Thus the decision tree out¬ 
put can be written as a weighted sum of basis functions 
in the same manner as a layered network. 

As this discussion suggests, decision trees and 
MLP/RBF neural networks are best viewed as being 
different points along the continuum of models having 
overlapping or non-overlapping basis functions. Indeed, 
as we show in the following section, decision trees can 
be treated probabilistically as mixture models, and in 
the mixture approach the sharp discriminant function 
boundaries of classical decision trees become smoothed, 
yielding partially-overlapping basis functions. 

There are tradeoffs associated with the continuum of 
degree-of-overlap—in particular, non-overlapping basis 
functions are generally viewed as being easier to inter¬ 
pret, and better able to reject noisy input variables that 
carry little information about the output. Overlapping 
basis functions are often viewed as yielding lower vari¬ 
ance predictions and as being more robust. 

2.5 General mixture models 

The use of mixture models is not restricted to density 
estimation; rather, the mixture approach can be used 
quite generally to build complex models out of simple 
parts. To illustrate, let us consider using mixture mod¬ 
els to model a conditional density in the context of a 
regression or classification problem. A mixture model 
in this setting is referred to as a “mixtures of experts” 
model [Jacobs, et ah, 1991]. 

Suppose that we have at our disposal an elemental 
conditional model p(t|x,w). Consider a situation in 
which the conditional mean or discriminant exhibits vari¬ 
ation on a local scale that is a good match to our elemen¬ 
tal model, but the variation differs in different regions of 
the input space. We could use a more complex network 
to try to capture this global variation; alternatively we 
might wish to combine local variants of our elemental 
models in some manner. This can be achieved by defin¬ 
ing the following probabilistic mixture: 

M 

p(t|x, w) = ^p(*|x,v)p(t|x,i, Wi). (11) 

8=1 

Comparing this mixture to the unconditional mixture 
defined earlier (Eq. 2), we see that both the mixing 
proportions and the component densities are now con¬ 
ditional densities dependent on the input vector x. The 
former dependence is particularly important—we now 
view the mixing proportion p(i |x, v) as providing a prob¬ 
abilistic device for choosing different elemental models 
(“experts”) in different regions of the input space. A 
learning algorithm that chooses values for the parame¬ 
ters v as well as the values for the parameters w; can 
be viewed as attempting to find both a good partition of 
the input space and a good fit to the local models within 
that partition. 


This approach can be extended recursively by consid¬ 
ering mixtures of models where each model may itself be 
a mixture model [Jordan and Jacobs, 1994]. Such a re¬ 
cursion can be viewed as providing a probabilistic inter¬ 
pretation for the decision trees discussed in the previous 
section. We view the decisions in the decision tree as 
forming a recursive set of probabilistic selections among 
a set of models. The total probability of a target t given 
an input x is the sum across all paths down the tree: 

M M 

P(t|x, w ) = ^p(i|x,u)^p(j|x, i, V{) ■ ■ -p(t\x,i,j, . . w„...), 

2=1 j= 1 

( 12 ) 

where i and j are the decisions made at the first 
level and second level of the tree, respectively, and 
p(t|x, i, j, . . . , Wy...) is the elemental model at the leaf 
of the tree defined by the sequence of decisions. This 
probabilistic model is a conditional hierarchical mixture. 
Finding parameter values u, v;, etc. to fit this model to 
data can be viewed as finding a nested set of partitions 
of the input space and fitting a set of local models within 
the partition. 

The mixture model approach can be viewed as a spe¬ 
cial case of a general methodology known as learning 
by committee. Bishop [1995] provides a discussion of 
committees; we will also meet them in the section on 
Bayesian methods later in the chapter. 

3 Learning from Data 

The previous section has provided a selection of mod¬ 
els to choose from; we now face the problem of match¬ 
ing these models to data. In principle the problem is 
straightforward: given a family of models of interest we 
attempt to find out how probable each of these mod¬ 
els is in the light of the data. We can then select the 
most probable model (a selection rule known as maxi¬ 
mum a posteriori or MAP estimation), or we can se¬ 
lect some highly probable subset of models, weighted 
by their probability (an approach that we discuss below 
in the section on Bayesian methods). In practice there 
are a number of problems to solve, beginning with the 
specification of the family of models of interest. In the 
simplest case, in which the family can be described as a 
fixed structure with varying parameters (e.g., the class of 
feedforward MLP’s with a fixed number of hidden units), 
the learning problem is essentially one of parameter es¬ 
timation. If on the other hand the family is not easily 
viewed as a fixed parametric family (e.g., feedforward 
MLP’s with variable number of hidden units), then we 
must solve the model selection problem. 

In this section we discuss the parameter estimation 
problem. The goal will be to find MAP estimates of the 
parameters by maximizing the probability of the param¬ 
eters given the data V. We compute this probability 
using Bayes rule: 

k-ii» = *^S=>, (ij) 

where we see that to calculate MAP estimates we must 
maximize the expression in the numerator (the denomi- 



nator does not depend on w). Equivalently we can min¬ 
imize the negative logarithm of the numerator. We thus 
define the following cost function J(w): 

J(w) = — lnp(X>|w) — lnp(w), (14) 

which we wish to minimize with respect to the param¬ 
eters w. The first term in this cost function is a (neg¬ 
ative) log likelihood. If we assume that the elements in 
the training set V are conditionally independent of each 
other given the parameters, then the likelihood factorizes 
into a product form. For density estimation we have: 

N 

p('D\ w ) = n^( xn i w ) ( i5 ) 

n = 1 

and for classification and regression we have: 

N 

p( v \ w ) = If KOlXni w )- ( 16 ) 

n = 1 

In both cases this yields a log likelihood which is the sum 
of the log probabilities for each individual data point. 
For the remainder of this section we will assume this 
additive form; moreover, we will assume that the log 
prior probability of the parameters is uniform across the 
parameters and drop the second term. Thus we focus on 
maximum likelihood (ML) estimation, where we choose 
parameter values wml that maximize lnp(P|w). 

3.1 Likelihood-based cost functions 

Regression, classification and density estimation make 
different probabilistic assumptions about the form of the 
data and therefore require different cost functions. 

Eq. 3 defines a probabilistic model for regression. The 
model is a conditional density for the targets t in which 
the targets are distributed as Gaussian random variables 
(assuming Gaussian errors e) with mean values /(x). We 
now write the conditional mean as /(x, w) to make ex¬ 
plicit the dependence on the parameters w. Given the 
training set V = {x n , t n }(/ =1 , and given our assumption 
that the targets t n are sampled independently (given the 
inputs x n and the parameters w), we obtain: 

J ( W ) = TfXllI*" -/( x n, w )|| 2 , (17) 

n 

where we have assumed an identity covariance matrix 
and dropped those terms that do not depend on the 
parameters. This cost function is the standard least 
squares cost function which is traditionally used in neu¬ 
ral network training for real-valued targets. Minimiza¬ 
tion of this cost function is typically achieved via some 
form of gradient optimization, as we discuss in the fol¬ 
lowing section. 

Classification problems differ from regression prob¬ 
lems in the use of discrete-valued targets, and the like¬ 
lihood accordingly takes a different form. For binary 
classification the Bernoulli probability model p(t|x, w) = 
y t (l— J/) 1_t is natural, where we use y to denote the prob¬ 
ability p(t = l|x, w). This model yields the following log 
likelihood: 

J(w) = -^ [t n In y n + (1 - t n ) ln(l - y n )\ , (18) 


which is known as the cross entropy function. It can 
be minimized using the same generic optimization pro¬ 
cedures as are used for least squares. 

For multi-way classification problems in which there 
are C categories, where C > 2, the multinomial distri¬ 
bution is natural. Define t n such that its elements 
are one or zero according to whether the n th data point 
belongs to the i th category, and define t/ n y to be the net¬ 
work’s estimate of the posterior probability of category 
i for data point n; i.e., j/ n y = p(t n i = l|x n , w). Given 
these definitions we obtain the following cost function: 

J(w) = - t n ti In y„ t i, (19) 

n i 

which again has the form of a cross entropy. 

We now turn to density estimation as exemplified by 
Gaussian mixture modeling. The probabilistic model in 
this case is that given in Eq. 2. Assuming Gaussian 
component densities with arbitrary covariance matrices, 
we obtain the following cost function: 

J ( w ) = liyliG ex P -Ri) T V _1 ( x n ~t*i) 

n i ' 1 ' ^ 

( 20 ) 

where the parameters w are the collection of mean vec¬ 
tors fi i , the covariance matrices 2 and the mixing pro¬ 
portions 7r 8 '. A similar cost function arises for the gener¬ 
alized mixture models (cf. Eq. 12). 

3.2 Gradients of the cost function 

Once we have defined a probabilistic model, obtained a 
cost function and found an efficient procedure for calcu¬ 
lating the gradient of the cost function, the problem can 
be handed off to an optimization routine. Before dis¬ 
cussing optimization procedures, however, it is useful to 
examine the form that the gradient takes for the exam¬ 
ples that we have discussed in the previous two sections. 

The i th output unit in a layered network is endowed 
with a rule for combining the activations of units in ear¬ 
lier layers, yielding a quantity that we denote by Z{, and 
a function that converts Z{ into the output yi. For regres¬ 
sion problems, we assume linear output units such that 
yi = Z{. For binary classification problems, our earlier 
discussion showed that a natural output function is the 
logistic: yi = 1/(1 + e~ z '). For multi-way classification, 
it is possible to generalize the derivation of the logistic 
function to obtain an analogous representation for the 
multi-way posterior probabilities known as the softmax 
function [cf. Bishop, 1995]: 


yi = 



( 21 ) 


where yi represents the posterior probability of category 

i. 

If we now consider the gradient of J(w) with respect 
to Zi, it turns out that we obtain a single canonical ex¬ 
pression of the following form: 


dJ 

dw 


!>-»>&• 


n 


6 


( 22 ) 



As discussed by [Rumelhart, et al. 1995], this form for 
the gradient is predicted from the theory of Generalized 
Linear Models [McCullagh and Nelder, 1983], where it is 
shown that the linear, logistic, and softmax functions are 
(inverse) canonical links for the Gaussian, Bernoulli, and 
multinomial distributions, respectively. Canonical links 
can be found for all of the distributions in the exponen¬ 
tial family, thus providing a solid statistical foundation 
for handling a wide variety of data formats at the output 
layer of a network, including counts, time intervals and 
rates. 

The gradient of the cost function for mixture mod¬ 
els has an interesting interpretation. Taking the partial 
derivative of J(w) in Eq. 20 with respect to p,, we find: 


dJ 

dl*i 


n 


(23) 


where h n g is defined as follows: 

h _ Tig |^i |~ 1 / 2 exp{ ^(x n - n i ) T S~ 1 (x n -p 8 )| 

X2* | _1/2 exp{ —i( x „ - H k ) T S^(x n -fi k )} 

(24) 

When summed over i, the quantity h n g sums to one, 
and is often viewed as the “responsibility” or “credit” 
assigned to the i th component for the n th data point. 
Indeed, interpreting Eq. 24 using Bayes rule shows that 
h n i is the posterior probability that the n th data point 
is generated by the i th component Gaussian. A learning 
algorithm based on this gradient will move the i th mean 
fi i toward the data point x n , with the effective step size 
proportional to h n g. 

The gradient for a mixture model will always take 
the form of a weighted sum of the gradients associated 
with the component models, where the weights are the 
posterior probabilities associated with each of the com¬ 
ponents. The key computational issue is whether these 
posterior weights can be computed efficiently. For Gaus¬ 
sian mixture models, the calculation (Eq. 24) is clearly 
efficient. For decision trees there are a set of posterior 
weights associated with each of the nodes in the tree, 
and a recursion is available that computes the posterior 
probabilities in an upward sweep [Jordan and Jacobs, 
1994]. Mixture models in the form of a chain are known 
as hidden Markov models, and the calculation of the rel¬ 
evant posterior probabilities is performed via an efficient 
algorithm known as the Baum-Welch algorithm. 

For general layered network structures, a generic al¬ 
gorithm known as “backpropagation” is available to cal¬ 
culate gradient vectors [Rumelhart, et ah, 1986]. Back- 
propagation is essentially the chain rule of calculus re¬ 
alized as a graphical algorithm. As applied to layered 
networks it provides a simple and efficient method that 
calculates a gradient in O(W) time per training pattern, 
where W is the number of weights. 


3.3 Optimization algorithms 

By introducing the principle of maximum likelihood in 
Section 1, we have expressed the problem of learning in 
neural networks in terms of the minimization of a cost 
function J(w) which depends on a vector w of adaptive 
parameters. An important aspect of this problem is that 


the gradient vector Vw J can be evaluated efficiently 
(for example by backpropagation). Gradient-based min¬ 
imization is a standard problem in unconstrained non¬ 
linear optimization, for which many powerful techniques 
have been developed over the years. Such algorithms 
generally start by making an initial guess for the param¬ 
eter vector w and then iteratively updating the vector 
in a sequence of steps: 

W ( r+1 ) = w*-”- 1 + Aw*-”- 1 (25) 

where r denotes the step number. The initial parameter 
vector w(°) is often chosen at random, and the final vec¬ 
tor represents a minimum of the cost function at which 
the gradient vanishes. Due to the nonlinear nature of 
neural network models, the cost function is generally a 
highly complicated function of the parameters, and may 
possess many such minima. Different algorithms differ 
in how the update Aw( T ) is computed. 

The simplest such algorithm is called gradient descent 
and involves a parameter update which is proportional 
to the negative of the cost function gradient A = —rjVE 
where rj is a fixed constant called the learning rate. It 
should be stressed that gradient descent is a particularly 
inefficient optimization algorithm. Various modifications 
have been proposed, such as the inclusion of a momen¬ 
tum term, to try to improve its performance. In fact 
much more powerful algorithms are readily available, as 
described in standard textbooks such as [Fletcher, 1987]. 
Two of the best known are called conjugate gradients 
and guasi-Newton (or variable metric) methods. For the 
particular case of a sum-of-squares cost function, the 
Levenberg-Marguardt algorithm can also be very effec¬ 
tive. Software implementations of these algorithms are 
widely available. 

The algorithms discussed so far are called batch since 
they involve using the whole data set for each evalua¬ 
tion of the cost function or its gradient. There is also a 
stochastic or on-line version of gradient descent in which, 
for each parameter update, the cost function gradient is 
evaluated using just one of the training vectors at a time 
(which are then cycled either in order or in a random 
sequence). While this approach fails to make use of the 
power of sophisticated methods such as conjugate gradi¬ 
ents, it can prove effective for very large data sets, par¬ 
ticularly if there is significant redundancy in the data. 

3.4 Hessian matrices, error bars and pruning 

After a set of weights have been found for a neural net¬ 
work using an optimization procedure, it is often useful 
to examine second-order properties of the fitted network 
as captured in the Hessian matrix H = d 2 J/dwdw T . 
Efficient algorithms have been developed to compute the 
Hessian matrix in time 0(W 2 ) [Bishop, 1995]. As in the 
case of the calculation of the gradient by backpropaga¬ 
tion, these algorithms are based on recursive message 
passing in the network. 

One important use of the Hessian matrix lies in 
the calculation of error bars on the outputs of a net¬ 
work. If we approximate the cost function locally as 
a quadratic function of the weights (an approximation 
which is equivalent to making a Gaussian approxima¬ 
tion for the log likelihood), then the estimated variance 



3.0 


M= 1 



Figure 4: Effects of model complexity illustrated by 
modeling a mixture of two Gaussians (shown by the 
dashed curves) using a mixture of M Gaussians (shown 
by the solid curves). The results are obtained for 20 
cycles of EM. 


of the i th output t/,; can be shown to be: 

T 

H~ l 


dyi 

dw 


dyi 

dw 


(26) 


where the gradient vector Oyi/Ow can be calculated via 
ba.ckpropa.ga.t.ion. 

The Hessian matrix is also useful in pruning algo¬ 
rithms. A pruning algorithm deletes weights from a fit¬ 
ted network to yield a simpler network that may out¬ 
perform a more complex, overfitted network (see below), 
and may be easier to interpret. In this setting, the Hes¬ 
sian is used to approximate the increase in the cost, func¬ 
tion due to the deletion of a weight. A variety of such 
pruning algorithms are available [cf. Bishop, 1995]. 


3.5 Complexity control 


In previous sections we have introduced a variety of mod¬ 
els for representing probability distributions, we have 
shown how the parameters of the models can be opti¬ 
mized by maximizing the likelihood function, and we 
have outlined a number of powerful algorithms for per¬ 
forming this minimization. Before we can apply this 
framework in practice there is one more issue we need 
to address, which is that of model complexity. Consider 
the case of a mixture model given by Eq. 2. The number 
of input variables will be determined by the particular 
problem at hand. However, the number M of compo¬ 
nent. densities has yet. t.o be specified. Clearly if M is too 
small the model will be insufficiently flexible and we will 
obtain a. poor representation of the true density. What, 
is not. so obvious is that if M is too large we can also 
obtain poor results. This effect, is known as overfittmg 
and arises because we have a. data. set. of finite size. It. 
is illustrated using a. simple example of mixture density 
estimation in Figure 4. Here a. set. of 100 data, points 
in oni; dimension has been generated from a. distribution 
consisting of a. mixture of two Gaussians (shown by the 
dashed curves). This data. set. has then been fitted by 
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a. mixture of M Gaussians by use of the EM algorithm. 
We see that a. model with 1 component. (M = 1) gives a. 
poor representation of the true distribution from which 
the data, was generated, and in particular is unable to 
capture the bimoda.l aspect. For M = 2 the model gives 
a. good fit, as we expect, since the data, was itself gener¬ 
ated from a. two-component. Gaussian mixture. However, 
increasing the number of components to M = 10 gives a. 
poorer fit., even though this model contains the simpler 
models as special cases. 

The problem is a. very fundamental one and is associ¬ 
ated with the fact, that we are trying to infer an entire 
distribution function from a. finite number of data, points, 
which is necessarily an ill-posed problem. In regression 
for example there are infinitely many functions which 
will give a. perfect, fit. to the finite number of data, points. 
If the data, are noisy, however, the best, generalization 
will be obtained for a. function which does not. fit. the 
data, perfectly but. which captures the underlying func¬ 
tion from which the data, were generated. By increasing 
the flexibility of the model we are able to obtain ever 
better fits to the training data, and this is reflected in 
a. steadily increasing value for the likelihood function at 
its maximum. Our goal is to model the true underly¬ 
ing density function from which the data, was generated 
since this allows us to make the best, predictions for new 
data. We see that the best, approximation to this density 
occurs for an intermediate value of M. 

The same issue arises in connection with nonlinear 
regression and classification problems. For example, the 
number M of hidden units in an MLP network controls 
the model complexity and must, be optimized to give the 
best, generalization. In a. practical application we can 
train a. variety of different, models having different, com¬ 
plexity, and compare their generalization performance 
using an independent, validation set., and then select, the 
model with the best, generalization. In fact, the process 
of optimizing the complexity using a. validation set. can 
lead to some partial overfitting to the validation data, it¬ 
self, and so the final performance of the selected model 
should be confirmed using a. third independent, data. set. 
called a. test set.. 

Some theoretical insight, into the problem of overfit¬ 
ting can be obtained by decomposing the error into the 
sum of bias and variance terms [Gerna.n, et. ah, 1992]. A 
model which is too inflexible is unable to represent, the 
true structure in the underlying density function and this 
gives rise to a. high bias. Conversely a. model which is 
too flexible becomes tuned to the specific details of the 
particular data. set. and gives a. high variance. The best, 
generalization is obtained from the optimum t.ra.de-off of 
bias against, variance. 

As we have already remarked, the problem of infer¬ 
ring an entire distribution function from a. finite data. set. 
is fundamentally ill-posed since there are infinitely many 
solutions. The problem only becomes well-posed when 
some additional constraint, is imposed. This constraint, 
might, be that we model the data, using a. network having 
a. limited number of hidden units. Within the range of 
functions which this model can represent, there is then a. 
unique function which best, fits the data. Implicitly we 






are assuming that the underlying density function from 
which the data were drawn is relatively smooth. Instead 
of limiting the number of parameters in the model, we 
can encourage smoothness more directly using the tech¬ 
nique of regularization. This involves adding a penalty 
term SI to the original cost function J to give a total cost 
function J of the form: 

J = J + v ft (27) 

where v is called a regularization coefficient. The net¬ 
work parameters are determined by minimizing J, and 
the value of v controls the degree of influence of the 
penalty term SI. In practice SI is typically chosen to 
encourage smooth functions. The simplest example is 
called weight decay and consists of the sum of the squares 
of all the adaptive parameters in the model: 

« = ( 28 ) 

i 

Consider the effect of such a term on the MLP function 
(Eq. 9). If the weights take very small values then the 
network outputs become approximately linear functions 
of the inputs (since the sigmoidal function is approx¬ 
imately linear for small values of its argument). The 
value of v in Eq. 27 controls the effective complexity of 
the model, so that for large v the model is over-smoothed 
(corresponding to high bias) while for small v the model 
can overht (corresponding to high variance). We can 
therefore consider a network with a relatively large num¬ 
ber of hidden units and control the effective complexity 
by changing v. In practice, a suitable value for v can be 
found by seeking the value which gives the best perfor¬ 
mance on a validation set. 

The weight decay regularizer (Eq. 28) is simple to im¬ 
plement but suffers from a number of limitations. Regu¬ 
larizes used in practice may be more sophisticated and 
may contain multiple regularization coefficients [Neal, 
1994], 

Regularization methods can be justified within a gen¬ 
eral theoretical framework known as structural risk min¬ 
imization [Vapnik, 1995]. Structural risk minimization 
provides a quantitative measure of complexity known as 
the VC dimension. The theory shows that the VC di¬ 
mension predicts the difference between performance on 
a training set and performance on a test set; thus, the 
sum of log likelihood and (some function of) VC dimen¬ 
sion provides a measure of generalization performance. 
This motivates regularization methods (Eq. 27) and pro¬ 
vides some insight into possible forms for the regularizer 
ft. 

3.6 Bayesian viewpoint 

In earlier sections we discussed network training in terms 
of the minimization of a cost function derived from the 
principle of maximum a posteriori or maximum likeli¬ 
hood estimation. This approach can be seen as a par¬ 
ticular approximation to a more fundamental, and more 
powerful, framework based on Bayesian statistics. In the 
maximum likelihood approach the weights w are set to 
a specific value wml determined by minimization of a 


cost function. However, we know that there will typi¬ 
cally be other minima of the cost function which might 
give equally good results. Also, weight values close to 
wml should give results which are not too different from 
those of the maximum likelihood weights themselves. 

These effects are handled in a natural way in the 
Bayesian viewpoint, which describes the weights not in 
terms of a specific set of values, but in terms of a proba¬ 
bility distribution over all possible values. As discussed 
earlier (cf. Eq. 13), once we observe the training data set 
V we can compute the corresponding posterior distribu¬ 
tion using Bayes’ theorem, based on a prior distribution 
function p( w) (which will typically be very broad), and 
a likelihood function p(V |w): 

»(w|X>) = (29) 

The likelihood function will typically be very small ex¬ 
cept for values of w for which the network function is 
reasonably consistent with the data. Thus the posterior 
distribution p(w\T>) will be much more sharply peaked 
than the prior distribution p( w) (and will typically have 
multiple maxima). The quantity we are interested in is 
the predicted distribution of target values t for a new 
input vector x once we have observed the data set V. 
This can be expressed as an integration over the poste¬ 
rior distribution of weights of the form: 

p(t|x,X>) = J p(t\x,w)p(w\V)dw (30) 

where p(t|x, w) is the conditional probability model dis¬ 

cussed in the introduction. 

If we suppose that the posterior distribution p(w\T>) 
is sharply peaked around a single most-probable value 
wmp, then we can write Eq. 30 in the form: 

p( t|x,X>) ~ p(t|x, w MP ) J p(w\V)dw (31) 

= p(t|x, w MP ) (32) 

and so predictions can be made by fixing the weights to 
their most probable values. We can find the most proba¬ 
ble weights by maximizing the posterior distribution, or 
equivalently by minimizing its negative logarithm. Us¬ 
ing Eq. 29, we see that wmp is determined by minimiz¬ 
ing a regularized cost function of the form in Eq. 27 in 
which the negative log of the prior — lnp(w) represents 
the regularizer pft. For example, if the prior consists of 
a zero-mean Gaussian with variance p -1 then we obtain 
the weight-decay regularizer of Eq. 28. 

The posterior distribution will become sharply peaked 
when the size of the data set is large compared to the 
number of parameters in the network. For data sets 
of limited size, however, the posterior distribution has 
a finite width and this adds to the uncertainty in the 
predictions for t which can be expressed in terms of er¬ 
ror bars. Bayesian error bars can be evaluated using a 
local Gaussian approximation to the posterior distribu¬ 
tion [MacKay, 1992]. The presence of multiple maxima 
in the posterior distribution also contributes to the un¬ 
certainties in predictions. The capability to assess these 



uncertainties can play a crucial role in practical applica¬ 
tions. 

The Bayesian approach can also deal with more gen¬ 
eral problems in complexity control. This can be done 
by considering the probabilities of a set of alternative 
models, given the data set: 


p(H t \V) 


p(V\Hi)p(7ii) 

p(V) 


(33) 


Here different models can also be interpreted as different 
values of regularization parameters as these too control 
model complexity. If the models are given the same prior 
probabilities p(7ii) then they can be ranked by consider¬ 
ing the evidence p(V\Hi) which itself can be evaluated by 
integration over the model parameters w. We can simply 
select the model with the greatest probability. However, 
a full Bayesian treatment requires that we form a linear 
combination of the predictions of the models in which 
the weighting coefficients are given by the model proba¬ 
bilities. 

In general, the required integrations, such as that in 
Eq. 30, are analytically intractable. One approach is 
to approximate the posterior distribution by a Gaus¬ 
sian centered on wmp and then to linearize p(t|x,w) 
about wmp so that the integration can be performed an¬ 
alytically [MacKay, 1992]. Alternatively, sophisticated 
Monte Carlo methods can be employed to evaluate the 
integrals numerically [Neal, 1994]. An important aspect 
of the Bayesian approach is that there is no need to keep 
data aside in a validation set as is required when using 
maximum likelihood. In practical applications for which 
the quantity of available data are limited, it is found 
that a Bayesian treatment generally outperforms other 
approaches. 


3.7 Pre-processing, invariances and prior 
knowledge 


We have already seen that neural networks can approxi¬ 
mate essentially arbitrary nonlinear functional mappings 
between sets of variables. In principle we could therefore 
use a single network to transform the raw input variables 
into the required final outputs. However, in practice for 
all but the simplest problems the results of such an ap¬ 
proach can be improved upon considerably by incorpo¬ 
rating various forms of pre-processing, for reasons which 
we shall outline below. 

One of the simplest and most common forms of pre¬ 
processing consists of a simple normalization of the in¬ 
put, and possibly also target, variables. This may take 
the form of a linear rescaling of each input variable in¬ 
dependently to give it zero mean and unit variance over 
the training set. For some applications the original input 
variables may span widely different ranges. Although 
a linear rescaling of the inputs is equivalent to a dif¬ 
ferent choice of first-layer weights, in practice the opti¬ 
mization algorithm may have considerable difficulty in 
finding a satisfactory solution when typical input values 
are substantially different. Similar rescaling can be ap¬ 
plied to the output values in which case the inverse of 
the transformation needs to be applied to the network 
outputs when the network is presented with new inputs. 
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Pre-processing is also used to encode data in a suitable 
form. For example, if we have categorical variables such 
as ‘red’, ‘green’ and ‘blue’, these may be encoded using 
a l-of-3 binary representation. 

Another widely used form of pre-processing involves 
reducing the dimensionality of the input space. Such 
transformations may result in loss of information in the 
data, but the overall effect can be a significant improve¬ 
ment in performance as a consequence of the curse of 
dimensionality discussed in Section 3.5. The finite data 
set is better able to specify the required mapping in the 
lower-dimensional space. Dimensionality reduction may 
be accomplished by simply selecting a subset of the orig¬ 
inal variables, but more typically involves the construc¬ 
tion of new variables consisting of linear or nonlinear 
combinations of the original variables called features. A 
standard technique for dimensionality reduction is prin¬ 
cipal component analysis [Anderson, 1984]. Such meth¬ 
ods, however, make use only of the input data and ignore 
the target values, and can sometimes be significantly 
sub-optimal. 

Yet another form of pre-processing involves correcting 
deficiencies in the original data. A common occurrence 
is that some of the input variables are missing for some 
of the data points. Correction of this problem in a prin¬ 
cipled way requires that the probability distribution p(x) 
of input data be modeled. 

One of the most important factors determining the 
performance of real-world applications of neural net¬ 
works is the use of prior knowledge which is information 
additional to that present in the data. As an example, 
consider the problem of classifying hand-written digits 
discussed in Section 1. The most direct approach would 
be to collect a large training set of digits and to train 
a feedforward network to map from the input image to 
a set of 10 output values representing posterior prob¬ 
abilities for the 10 classes. However, we know that the 
classification of a digit should be independent of its posi¬ 
tion within the input image. One way of achieving such 
translation invariance is to make use of the technique 
of shared weights. This involves a network architecure 
having many hidden layers in which each unit takes in¬ 
puts only from a small patch, called a receptive field, of 
units in the previous layer. By a process of constrain¬ 
ing neighboring units to have common weights, it can be 
arranged that the output of the network is insensitive 
to translations of the input image. A further benefit of 
weight sharing is that the number of independent param¬ 
eters is much smaller than the number of weights, which 
assists with the problem of model complexity. This ap¬ 
proach is the basis for the highly successful US postal 
code recognition system of [LeCun, et ah, 1989]. An 
alternative to shared weights is to enlargen the training 
set artificially by generating “virtual examples” based on 
applying translations and other transformations to the 
original training set [Poggio and Vetter, 1992]. 

4 Graphical models 

Neural networks express relationships between variables 
by utilizing the representational language of graph the¬ 
ory. Variables are associated with nodes in a graph 
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Figure 5: (a) An undirected graph in which A',; is inde¬ 
pendent of Xj given AA and A';, and AA is independent 
of A'; given A',; and Xj . (b) A directed graph in which A',; 
and A'*,, are marginally independent but are conditionally 
dependent given Xj. 
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Figure 6: (a) A directed graph representation of an 

HMM. Each horizontal link is associated with the tran¬ 
sition matrix A, and each vertical link is associated with 
the emission matrix B. (b) An HMM as a Boltzmann 
machine. The parameters on the horizontal links are log¬ 
arithms of the entries of the A matrix, and the parame¬ 
ters on the vertical links are logarithms of the. entries of 
the B matrix. The two representations yield the same 
joint probability distribution. 


and transformations of variables are based on algorithms 
that propagate numerical messages along the links of the 
graph. Moreover, the graphs are often accompanied by 
probabilistic interpretations of the variables and their 
interrelationships. As we have seen, such probabilistic 
interpretations allow a neural network to be understood 
as a form of probabilistic model, and reduce the prob¬ 
lem of learning the weights of a network to a problem in 
statistics. 

Related graphical models have been studied through¬ 
out statistics, engineering and AI in recent years. Hidden 
Markov models, Kalman filters, and path analysis mod¬ 
els are all examples of graphical probabilistic models that 
can be fitted to data and used to make inferences. The 
relationship between these models and neural networks 
is rather strong; indeed it is often possible to reduce one 
kind of model to the other. In this section, we examine 
these relationships in some detail and provide a broader 
characterization of neural networks as members of a gen¬ 
eral family of graphical probabilistic models. 

Many interesting relationships have been discovered 
between graphs and probability distributions [Spiegel- 
halter, et. ah, 1993]; [Pearl, 1988]. These relationships 
derive from the use of graphs to represent conditional in¬ 
dependencies among random variables. In an undirected 
graph, there is a direct correspondence between con¬ 
ditional independence and graph separation—random 
variables A',; and X^ are conditionally independent given 
Xj if nodes A',; and AA are separated by node Xj (we 
use the symbol “A',;” to represent both a random vari¬ 
able and a node in a graph). This statement remains 
true for sets of nodes (see Figure 5(a)). Directed graphs 
have a somewhat different semantics, due to the abil¬ 
ity of directed graphs to represent “induced dependen¬ 
cies.” An induced dependency is a situation in which two 
nodes which are marginally independent become condi¬ 
tionally dependent given the value of a third node (see 
Figure 5(b)). Suppose, for example, that A',; and X^ 
represent independent coin tosses, and Xj represents the 
sum of A 'i and X^. Then A',; and X^ are marginally inde¬ 
pendent but are conditionally dependent given Xj . The 


semantics of independence in directed graphs is captured 
by a graphical criterion known as d-separation [Pearl, 
1988], which differs from undirected separation only in 
those cases in which paths have two arrows arriving at 
the same node (as in Figure 5(b)). 

Although the neural network architectures that we 
have discussed until now have all been based on di¬ 
rected graphs, undirected graphs also play an impor¬ 
tant role in neural network research. Constraint sat¬ 
isfaction architectures, including the Hopheld network 
[Hopheld, 1982] and the Boltzmann machine [Hinton and 
Sejnowski, 1986], are the most prominent examples. A 
Boltzmann machine is an undirected probabilistic graph 
that respects the conditional independency semantics de¬ 
scribed above (cf. Figure 5(a)). Each node in a Boltz¬ 
mann machine is a binary-valued random variable A',; 
(or more generally a discrete-valued random variable). 
A probability distribution on the ‘2 N possible configura¬ 
tions of such variables is defined via an energy function 
E. Let .Jjj be the weight on the link between A',; and Xj, 
let .Jij = .Jji, let a index the configurations, and define 
the energy of configuration a as follows: 

/. . \ Y • (34) 

i<j 

The probability of configuration a is then defined via the 
Boltzmann distribution: 

e -E a /T 

Pa = ^ _ E , T , (35) 

E 7 e '' 

where the temperature T provides a scale for the energy. 

An example of a directed probabilistic graph is the 
hidden Markov model (HMM). An HMM is defined by 
a set of state variables Hi, where i is generally a time or 
a space index, a set of output variables O,;, a probability 
transition matrix A = p(Ff,;|Ff,;_i), and an emission ma¬ 
trix B = p(Oi\Hi). The directed graph for an HMM is 
shown in Figure 6(a). As can be seen from considering 
the separatory properties of the graph, the conditional 
independencies of the HMM are defined by the following 



Markov conditions: 

Hi J- {ffi, Oi,..., Hi. 2 , Oi. 2 , 


2 < i< N 
(36) 

and 

O f J- {ffi, Oi,..., Hi- 1, Os.;, | //. 2 < * < #, 

(3 ? ) 

where the symbol _L is used to denote independence. 

Figure 6(b) shows that it is possible to treat an HMM 
as a special case of a Boltzmann machine [Luttrell, 1989]; 
[Saul and Jordan, 1995]. The probabilistic structure of 
the HMM can be captured by defining the weights on the 
links as the logarithms of the corresponding transition 
and emission probabilities. The Boltzmann distribution 
(Eq. 35) then converts the additive energy into the prod¬ 
uct form of the standard HMM probabilility distribution. 
As we will sed, this reduction of a directed graph to an 
undirected graph is a recurring theme in the graphical 
model formalism. 

General mixture models are readily viewed as graph¬ 
ical models [Buntine, 1994]. For example, the uncon¬ 
ditional mixture model of Eq. 2 can be represented as 
a graphical model with two nodes—a multinomial “hid¬ 
den” node which represents the selected component, a 
“visible” node representing x, and a directed link from 
the hidden node to the visible node (see below for the 
hidden/visible distinction). Conditional mixture models 
[Jacobs, et ah, 1991] simply require another visible node 
with directed links to the hidden node and the visible 
nodes. Hierarchical conditional mixture models [Jordan 
and Jacobs, 1994] require a chain of hidden nodes, one 
hidden node foifiach level of the tree. 

Within the general framework of probabilistic graph¬ 
ical models, it is possible to tackle general problems of 
inference and learning. The key problem that arises in 
this setting is the problem of computing the probabilities 
of certain nodes, which we will refer to as hidden nodes , 
given the observed values of other nodes, which we will 
refer to as visible nodes. For example, in an HMM, the 
variables Oi are generally treated as visible, and it is de¬ 
sired to calculate a probability distribution on the hidden 
states Hi. A similar inferential calculation is required in 
the mixture models and the Boltzmann machine. 

Generic algorithms have been developed to solve the 
influential problem of the calculation of posterior prob¬ 
abilities in graphs. Although a variety of inference al¬ 
gorithms have been developed, they can all be viewed 
as essentially the same underlying algorithm [Shachter, 
Andersen, and Szolovit.s, 1994]. Let us consider undi¬ 
rected graphs. A special case of an undirected graph 
is a triangulated graph [Spiegelhalter, et ah, 1993], in 
which any cycle having four or more nodes has a chord. 
For example,, the graph in Figure 5(a) is not triangu¬ 
lated, but becomes triangulated when a link is added 
between nodes A',; and Xj. In a triangulated graph, the 
cliques of the graph can be arranged in the form of a 
junction tree , which is a tree having the property that 
any node that appears in two different cliques in the tree 
also appears in every clique on the path that links the 
two cliques (the “running intersection property”). This 
cannot be achieved in non-triangulated graphs. For ex¬ 
ample, the cliques in Figure 5(a) are {A',;, A/.}, {A/., A/}, 



(c) (d) 


Figure 7: The basic structure of the junction tree algo¬ 
rithm for undirected graphs. The graph in (a) is first 
triangulated (b), then the cliques are identified (c), and 
arranged into a tree (d). Products of potential functions 
on the nodes in (d) yield probability distributions on the 
nodes in (a). 


{A/-, A';}, and it is not possible to arrange these cliques 
into a tree that obeys the running intersection property. 
If a chord is added the resulting cliques are {A',;, Xj , A/.} 
and {A',;, A/, A';}, and these; cliques can be arranged as 
a simple chain that trivially obeys the running intersect 
t.ion property. In general, it turns out that the proba¬ 
bility distributions corresponding to triangulated graphs 
can be characterized as decomposable , which implies that 
they can be factorized into a product of local functions 
(“potentials”) associated with the cliques in the trian¬ 
gulated graph. 1 The calculation of posterior probabil¬ 
ities in decomposable distributions is straightforward, 
and can be achieved via a local message-passing algo¬ 
rithm on the junction tree [Spiegelhalter, qt al., 1993]. 

Graphs that are not triangulated can be turned into 
triangulated graphs by the addition of links. If the po¬ 
tentials on the new graph are defined suitably as prod¬ 
ucts of potentials on the original graph, then the inde¬ 
pendencies in the original graph are preserved. This im¬ 
plies that the algorithms for triangulated graphs can be 
used for all undirected graphs; an untriangulated graph 
is first triangulated (see Figure 7). Moreover, it is possi¬ 
ble to convert directed graphs to undirected graphs in a 
manner that preserves the probabilistic structure of the 
original graph [Spiegelhalter, et al., 1993]. This implies 
that the junction tree algorithm is indeed generic; it can 
be applied to any graphical model. 

The problem of calculating posterior probabilities on 
graphs is NP-hard; thus, a major issue in the use of 
the inference algorithms is the identification of cases 
in which they are efficient. Chain structures such as 
HMM’s yield efficient algorithms, and indeed the clas¬ 
sical forward-backward algorithm for HMM’s is a spe¬ 
cial, efficient case of the junction tree algorithm [Smyth, 
Heckerman, and Jordan, 1996]. Decision tree structures 
such as the hierarchical mixture of experts yield efficient 


J An interesting example is a Boltzmann machine on a tri¬ 
angulated graph. The potentials are products of exp( 
factors, where the product is taken over all (*, j) pairs in a 
particular clique. Given that the product across potentials 
must be the joint probability, this implies that the partition 
function (the denominator of Eq. 35) must be unity in this 
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algorithms, and the recursive posterior probability cal¬ 
culation of [Jordan and Jacobs, 1994] described earlier is 
also a special case of the junction tree algorithm. All of 
the simpler mixture model calculations described earlier 
are therefore also special cases. Another interesting spe¬ 
cial case is the state estimation algorithm of the Kalman 
filter [Shachter and Kenley, 1989]. Finally, there are a 
variety of special cases of the Boltzmann machine which 
are amenable to the exact calculations of the junction 
tree algorithm [Saul and Jordan, 1995]. 

For graphs that are outside of the tractable categories 
of trees and chains, the junction tree algorithm often per¬ 
forms surprisingly well, but for highly connected graphs 
the algorithm can be too slow. In such cases, approx¬ 
imate algorithms such as Gibbs sampling are utilized. 
A virtue of the graphical framework is that Gibbs sam¬ 
pling has a generic form, which is based on the notion 
of a Markov boundary [Pearl, 1988]. A special case of 
this generic form is the stochastic update rule for gen¬ 
eral Boltzmann machines. 

Our discussion has emphasized the unifying frame¬ 
work of graphical models both for expressing probabilis¬ 
tic dependencies in graphs and for describing algorithms 
that perform the inferential step of calculating posterior 
probabilities on these graphs. The unification goes fur¬ 
ther, however, when we consider learning. A generic 
methodology known as the Expectation-Maximization 
(EM) algorithm is available for MAP and Bayesian es¬ 
timation in graphical models [Dempster, Laird, and Ru¬ 
bin, 1977]. EM is an iterative method, based on two 
alternating steps: an E step, in which the values of hid¬ 
den variables are estimated, based on the current values 
of the parameters and the values of visible variables, and 
an M step, in which the parameters are updated based on 
the estimated values obtained from the E step. Within 
the framework of the EM algorithm, the junction tree al¬ 
gorithm can readily be viewed as providing a generic E 
step. Moreover, once the estimated values of the hidden 
nodes are obtained from the E step, the graph can be 
viewed as fully observed, and the M step is a standard 
MAP or ML problem. The standard algorithms for all 
of the tractable architectures described above (mixtures, 
trees and chains) are in fact instances of this general 
graphical EM algorithm, and the learning algorithm for 
general Boltzmann machines is a special case of a gener¬ 
alization of EM known as GEM [Dempster, et ah, 1977]. 

What about the case of feedforward neural networks 
such as the multilayer perceptron? It is in fact possible 
to associate binary hidden values with the hidden units 
of such a network (cf. our earlier discussion of the logis¬ 
tic function; see also [Amari, 1995]) and apply the EM 
algorithm directly. For N hidden units, however, there 
are 2 N patterns whose probabilities must be calculated 
in the E step. For large N, this is an intractable compu¬ 
tation, and recent research has therefore begun to focus 
on fast methods for approximating these distributions 
[Hinton, et ah, 1995]; [Saul, et ah, 1996]. 

Acknowledgements: We want to thank Peter Dayan 
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Further information 

In this chapter we have emphasized the links between 
neural networks and statistical pattern recognition. A 
more extensive treatment from the same perspective can 
be found in [Bishop, 1995]. For a view of recent research 
in the field, the proceedings of the annual NIPS (Neural 
Information Processing Systems; MIT Press) conferences 
are highly recommended. 

Neural computing is now a very broad field and there 
are many topics which have not been discussed for lack 
of space. Here we aim to provide a brief overview of some 
of the more significant omissions, and to give pointers to 
the literature. 

The resurgence of interest in neural networks during 
the 1980’s was due in large part to work on the statistical 
mechanics of fully connected networks having symmetric 
connections (i.e. if unit i sends a connection to unit j 
then there is also a connection from unit j back to unit i 
with the same weight value). We have briefly discussed 
such systems; a more extensive introduction to this area 
can be found in [Hertz, et ah, 1991]. 

The implementation of neural networks in specialist 
VLSI hardware has been the focus of much research, al¬ 
though by far the majority of work in neural computing 
is undertaken using software implementations running 
on standard platforms. 

An implicit assumption throughout most of this chap¬ 
ter is that the processes which give rise to the data are 
stationary in time. The techniques discussed here can 
readily be applied to problems such as time series fore¬ 
casting, provided this stationarity assumption is valid. 
If, however, the generator of the data is itself evolving 
with time then more sophisticated techniques must be 
used, and these are the focus of much current research 
[see Bengio, 1996]. 

One of the original motivations for neural networks 
was as models of information processing in biological sys¬ 
tems such as the human brain. This remains the subject 
of considerable research activity, and there is a continu¬ 
ing flow of ideas between the fields of neurobiology and 
of artificial neural networks. Another historical spring¬ 
board for neural network concepts was that of adaptive 
control, and again this remains a subject of great inter¬ 
est. 

Defining terms 

Classification A learning problem in which the goal 
is to assign input vectors to one of a number of 
(usually mutually exclusive) classes. 

Boltzmann machine An undirected network of dis¬ 
crete valued random variables, where an energy 
function is associated with each of the links, and 
for which a probability distribution is defined by 
the Boltzmann distribution. 

Cost function A function of the adaptive parameters 
of a model whose minimum is used to define suit¬ 
able values for those parameters. It may consist of 
a likelihood function and additional terms. 



Decision tree A network that performs a sequence of 
classificatory decisions on an input vector and pro¬ 
duces an output vector that is conditional on the 
outcome of the decision sequence. 

Density estimation The problem of modeling a prob¬ 
ability distribution from a finite set of examples 
drawn from that distribution. 

Discriminant function A function of the input vector 
which can be used to assign inputs to classes in a 
classification problem. 

Hidden Markov model A graphical 

probabilistic model characterized by a state vec¬ 
tor, an output vector, a state transition matrix, an 
emission matrix and an initial state distribution. 

Likelihood function The probability of observing a 
particular data set under the assumption of a given 
parametrized model, expressed as a function of the 
adaptive parameters of the model. 

Mixture model A probability model which consists of 
a linear combination of simpler component proba¬ 
bility models. 

Multilayer perceptron The most common form of 
neural network model, consisting of successive lin¬ 
ear transformations followed by processing with 
nonlinear activation functions. 

Overfitting The problem in which a model which is too 
complex captures too much of the noise in the data, 
leading to poor generalization. 

Radial basis function network A common network 
model consisting of a linear combination of basis 
functions each of which is a function of the differ¬ 
ence between the input vector and a center vector. 

Regression A learning problem in which the goal is 
to map each input vector to a real-valued output 
vector. 

Regularization A technique for controlling model 
complexity and improving generalization by the ad¬ 
dition of a penalty term to the cost function. 

VC dimension A measure of the complexity of a 
model. Knowledge of the VC dimension permits 
an estimate to be made of the difference between 
performance on the training set and performance 
on a test set. 
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